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ABSTRACT 

We  present  an  unsupervised  algorithm  for  detection  of  mov¬ 
ing  targets  in  highly  dynamic  scenes.  These  are  scenes  whose 
background  is  subject  to  stochastic  motion,  due  to  the  pres¬ 
ence  of  multiple  moving  objects  (crowds),  water,  trees  sway¬ 
ing  in  the  wind,  etc.  The  algorithm  is  inspired  by  biologi¬ 
cal  vision.  Target  detection  is  posed  as  a  problem  of  center- 
surround  saliency ,  which  aims  to  identify  the  locations  of  the 
visual  field  of  maximal  contrast  with  the  background.  Con¬ 
trast  is  defined  in  terms  of  both  appearance  and  motion  dy¬ 
namics,  and  measured  using  mutual  information  between  stochas¬ 
tic  models,  known  as  dynamic  textures,  which  can  account 
for  complex  motion.  This  enables  very  robust  target  detec¬ 
tion  in  the  classes  of  scenes  which  have  traditionally  proven 
most  adverse  to  tracking.  Extensive  tests  in  the  context  of  dy¬ 
namic  background  subtraction  have  shown  significantly  supe¬ 
rior  performance  to  previous  techniques. 

1.  INTRODUCTION 

In  a  natural  scene,  objects  of  interest  often  move  amidst  com¬ 
plicated  backgrounds  that  are  themselves  in  motion  e.g.  sway¬ 
ing  trees,  moving  water,  waves  and  rain.  The  visual  system 
of  animals  is  well  adapted  to  recognizing  the  most  impor¬ 
tant  moving  object  (referred  to  henceforth  as  the  “target”),  in 
such  scenes.  In  fact,  this  ability  is  central  to  survival,  for  in¬ 
stance,  by  aiding  in  the  identification  of  potential  predators  or 
prey  while  ignoring  unimportant  motion  in  the  background. 
Apart  from  the  obvious  importance  in  visual  systems  of  the 
biological  world,  target  detection  is  extremely  useful  for  vari¬ 
ous  computer  vision  applications  such  as  object  recognition  in 
video,  activity  and  gesture  recognition,  tracking,  surveillance 
and  video  analysis.  For  instance,  a  robot  or  an  autonomous 
vehicle  could  benefit  from  a  module  to  identify  objects  ap¬ 
proaching  it  amidst  possibly  moving  backgrounds  like  dust 
storms,  to  do  effective  path  planning. 

However,  unsupervised  moving  target  detection,  often  posed 
as  the  related  problem  of  background  subtraction,  is  hard  to 
solve  using  conventional  techniques  in  computer  vision(see 
(Sheikh  &  Shah,  2005)  for  a  review).  Extracting  the  fore¬ 


ground  object  moving  in  a  scene  where  the  background  it¬ 
self  is  dynamic  is  so  complex  that  even  though  background 
subtraction  is  a  classic  problem  in  computer  vision,  there  has 
been  relatively  little  progress  for  these  types  of  scenes. 

A  common  assumption  underlying  many  techniques  for 
background  subtraction  is  that  the  camera  capturing  the  scene 
is  static.  (Stauffer  &  Grimson,  1999;  Elgammal,  Harwood, 
&  Davis,  2000;  Wren,  Azarbayejani,  Darrell,  &  Pentland, 
1997;  Monnet,  Mittal,  Paragios,  &  Ramesh,  2003;  Tavakkoli, 
Nicolescu,  &  Bebis,  2006).  However,  this  assumption  places 
severe  restrictions  on  the  applicability  of  such  techniques  to 
real-world  video  clips,  that  are  often  shot  with  hand-held  cam¬ 
eras  or  even  on  a  moving  platform  in  the  case  of  autonomous 
vehicles.  Conventional  techniques  to  address  this  problem  in¬ 
volve  explicit  camera  motion  compensation  (Jung  &  Sukhatme, 
2004),  followed  by  stationary  camera  background  subtraction 
techniques.  But  these  methods  are  cumbersome  and  require  a 
reliable  estimate  of  the  global  motion.  In  extreme  cases,  when 
the  background  itself  is  highly  dynamic,  a  unique  global  mo¬ 
tion  itself  may  not  be  possible  to  estimate. 

Another  disadvantage  of  most  current  approaches  is  that 
they  model  the  background  explicitly  and  assume  that  the 
algorithm  will  initially  be  presented  with  frames  containing 
only  the  background  (Monnet  et  al.,  2003;  Stauffer  &  Grim¬ 
son,  1999;  Zivkovic,  2004).  The  background  model  is  built 
using  this  data,  and  regions  or  pixels  that  deviate  from  this 
model  are  considered  part  of  the  target  or  foreground.  Hence, 
these  techniques  are  supervised,  and  the  initial  phase  could  be 
thought  of  as  training  the  algorithm  to  learn  the  background 
parameters.  The  need  to  train  such  algorithms  for  each  scene 
separately  limits  their  ability  to  be  deployed  for  automatic 
surveillance  tasks,  where  manual  re- training  of  the  module 
to  operate  in  each  new  scene  is  not  feasible. 

A  further  shortcoming  in  typical  algorithms  is  that  they 
often  make  unjustified  assumptions  on  the  motion  character¬ 
istics  of  the  target.  For  instance,  it  is  often  assumed  that  the 
foreground  moves  in  a  consistent  direction  (temporal  persis¬ 
tence)  (Wixson,  2000;  Li,  2004;  Bugeau  &  Perez,  2007),  with 
more  rapid  appearance  changes  than  the  background  (Sheikh 
&  Shah,  2005).  However,  these  are  not  always  valid,  espe- 
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Fig.  1.  (a)  and  (b)  Two  consecutive  frames  from  a  video  clip 
with  camera  motion,  (c)  the  optical  flow  information  overlaid 
on  (a).  There  is  no  consistent  pattern  of  optical  flow  in  the 
foreground  region  in  the  image. 


cially  when  there  is  egomotion.  As  an  illustration,  two  con¬ 
secutive  frames  from  a  video  clip  shot  with  a  moving  camera 
are  shown  in  Figures  1(a)  and  (b).  The  camera  panning  is 
such  that  the  objects  of  interest,  viz.  the  two  cyclists,  undergo 
very  small  motion  in  the  image  coordinates.  Figure  1(c) 
shows  the  optical  flow  between  the  two  frames.  The  back¬ 
ground  is  changing  rapidly  and  there  is  no  consistent  pattern 
of  flow  vectors  in  the  foreground  region.  The  inversion  (with 
respect  to  the  stationary  camera  scenario)  of  the  motion  char¬ 
acteristics  of  background  (which  is,  in  this  case,  fast  moving 
and  temporally  coherent)  and  foreground  (whose  motion  is 
barely  existent  and  mostly  random)  can  be  a  major  challenge 
for  existing  background  subtraction  techniques. 

To  address  these  limitations  of  existing  algorithms,  we 
propose  a  novel  paradigm  for  unsupervised  target  detection 
using  motion  saliency.  The  algorithm  is  based  on  the  idea  that 
in  the  absence  of  high-level  goals  (such  as  explicit  search  for 
a  known  object)  the  target  consists  of  the  most  salient  loca¬ 
tions  of  the  visual  field.  Salient  locations  in  turn  are  those  that 
enable  the  discrimination  between  center  and  surround  at  that 
location  with  smallest  expected  probability  of  error.  This  is 
formalized  in  a  biologically  inspired  framework  referred  to  as 
the  discriminant  center- surround  hypothesis  (Gao  &  Vascon- 
celos,  2005,  2007)  and,  by  definition,  produces  saliency  mea¬ 
sures  that  are  optimal  in  a  classification  sense.  This  frame¬ 
work  can  be  applied  to  any  type  of  stimuli  and  features,  and 
optimal  saliency  detectors  have  already  been  derived  for  vari¬ 
ous  stimulus  modalities  for  static  images,  including  color  and 
orientation  (Gao  &  Vasconcelos,  2007).  In  this  work,  we 
extend  the  notion  of  discriminant  center- surround  saliency  to 
moving  stimuli.  By  defining  saliency  in  a  discriminant  sense, 
we  eliminate  the  need  to  separately  model  the  background  or 
the  target.  A  single  model  for  representing  the  motion  of  a 
region  of  the  video  is  sufficient  and  the  most  salient  moving 


object  is  simply  the  one  that  best  stands  out  among  other  ob¬ 
jects  in  the  video  with  respect  to  this  model.  As  the  algorithm 
compares  the  regions  against  one  another,  it  depends  only  on 
the  relative  disparity  between  their  motion  characteristics, 
and  therefore  is  invariant  to  camera  motion. 

In  order  to  extend  this  architecture  to  moving  stimuli,  prob¬ 
abilistic  models  that  capture  the  motion  patterns  in  video  are 
needed.  In  this  work,  we  choose  dynamic  textures  (Doretto, 
Chiuso,  Wu,  &  Soatto,  2003)  to  model  motion  due  to  their 
versatility  in  modeling  complex  moving  patterns  and  the  rich 
statistical  formulations  they  lend  themselves  to.  In  particu¬ 
lar,  dynamic  textures  provide  a  unified  generative  stochastic 
model  for  appearance  as  well  as  motion,  and  these  can  be 
conveniently  incorporated  into  a  discriminant  center- surround 
framework. 

The  main  contributions  of  this  work  are  as  follows,  (a) 
The  proposed  algorithm  is  completely  unsupervised  and  does 
not  require  initial  training.  This  enables  the  algorithm  to  adapt 
to  any  scene  without  manual  intervention,  (b)  By  modeling 
the  video  sequences  using  dynamic  textures,  saliency  in  mo¬ 
tion  and  appearance  are  both  taken  into  account  in  a  princi¬ 
pled  manner,  without  the  need  to  model  either  explicitly.  The 
proposed  discriminant  motion  saliency  algorithm  can  auto¬ 
matically  distinguish  between  object  and  background  motion 
due  to  the  distinct  appearance  and  motion  characteristics  of 
the  two  regions,  (c)  Finally,  being  a  discriminant  technique, 
the  algorithm  ignores  egomotion,  and  can  handle  video  clips 
shot  with  a  moving  camera. 

The  remaining  sections  of  the  paper  are  organized  as  fol¬ 
lows:  the  discriminant  saliency  architecture  is  presented  in 
Section  2.  Representation  of  the  target  and  background  us¬ 
ing  dynamic  texture  models  are  discussed  in  Section  3.  The 
target  detection  algorithm  is  summarized  in  4.  Experimental 
evaluation  and  results  form  Section  5. 

2.  DISCRIMINANT  CENTER-SURROUND 
SALIENCY 

Discriminant  saliency  (Gao  &  Vasconcelos,  2007)  is  defined 
with  respect  to  two  classes  of  stimuli:  the  class  of  stimuli 
of  interest ,  and  the  background  or  null  hypothesis,  consist¬ 
ing  of  stimuli  that  are  not  salient.  The  locations  of  the  vi¬ 
sual  field  that  can  be  classified,  with  lowest  expected  proba¬ 
bility  of  error,  as  containing  stimuli  of  interest  are  denoted  as 
salient.  This  is  accomplished  by  setting  up  a  binary  classifica¬ 
tion  problem  which  opposes  the  stimuli  of  interest  to  the  null 
hypothesis.  The  saliency  of  each  location  in  the  visual  field 
is  then  equated  to  the  discriminant  power  (expected  classi¬ 
fication  accuracy)  of  the  visual  features  extracted  from  that 
location  in  differentiating  the  two  classes. 

Formally,  let  'V  be  a  d  dimensional  dataset  indexed  by  lo¬ 
cation  vector  l  e  L  clj  and  consider  the  responses  to  visual 
stimuli  of  a  predefined  set  of  features  Y  (e.g.  raw  pixel  values, 
Gabor  or  Fourier  features),  computed  from  at  all  locations 


Fig.  2.  Illustration  of  discriminant  center- surround  saliency. 
Center  and  surround  windows  are  defined  around  each  image 
location,  and  the  distribution  of  a  previously  defined  set  of 
features  F  estimated  from  the  two  windows.  The  saliency 
of  the  location  is  a  measure  of  how  disjoint  the  two  feature 
distributions  are. 


map  of  the  dataset  T7.  It  can  also  be  written  as 

S(D  =  g  Pca)(c)  J Pnc(0(y\c)log^^dy  (2) 

1 

=  ^  Pc(i)(c)KL  ( PY\c(i)(y\c )  \\Py(j)  )  (3) 

c= 0 

where 

KL(p\\q)=  f  pv(y)  log  ^f\dy. 

Jyj  qr(y) 

is  the  Kullback-Leibler  (KL)  divergence  between  the  proba¬ 
bility  distributions  px(x)  and  qx(x)  (Kullback,  1968).  This 
allows  an  alternative  interpretation  of  saliency  as  a  measure 
of  the  average  distance  between  the  feature  distribution  over 
each  window  and  the  average  of  the  two  distributions.  This 
is  a  measure  of  the  (lack  of)  overlap  between  the  distributions 
associated  with  center  and  surround. 


I  e  L.  A  classification  problem  opposing  two  classes,  of  class 
label  C(Z)  €  {0, 1},  is  posed  at  location  /.  Two  windows  are 
defined:  a  neighborhood  <W\  of  l  which  is  denoted  as  center , 
and  a  surrounding  annular  window  ®  which  is  denoted  as 
the  surround.  The  union  of  the  two  windows  is  denoted  the  to¬ 
tal  window,  Tt7/  =  TVj3  U  "Wj .  Let  be  the  vector  of  feature 
responses  at  location  j.  Features  in  the  center,  €  *Wj }, 
are  drawn  from  the  class  of  interest  (or  alternate  hypothesis) 
C(l )  =  1,  with  probability  density  py|c(/)(y|l)-  Features  in  the 
surround,  {y^\j  g  VF^},  are  drawn  from  the  null  hypothesis 
C(l)  =  0,  with  probability  density  PY\c(i)(y\®)-  An  illustra¬ 
tion  of  the  center- surround  classification  problem,  for  a  static 
image,  is  shown  in  Figure  2. 

The  saliency  of  location  /,  S  (/),  is  the  extent  to  which 
the  features  Y  can  discriminate  between  center  and  surround. 
This  is  quantified  by  the  mutual  information  between  features, 
F,  and  class  label,  C, 


S(l) 


h(Y\C) 

sf  Pr,ai)(y,  c)  log 


PY,a/>(y,  c) 
PY(y)pm(c )  y' 


(1) 


This  mutual  information  is  an  approximation  to  the  ex¬ 
pected  probability  of  correct  classification  (more  precisely  one 
minus  the  Bayes  error  rate)  of  the  classification  problem  that 
opposes  center  to  surround  (Vasconcelos,  2003).  So,  a  large 
value  of  saliency  S  (/)  implies  that  center  and  surround  have 
a  large  local  feature  contrast ,  which  enables  their  discrimina¬ 
tion  with  low  probability  of  error.  Conversely,  the  locations 
where  the  classification  as  a  target  has  the  smallest  expected 
probability  of  error  can  be  identified  by  searching  for  maxima 
of  S  (/).  The  function  S  (/),  /  €  L  is  referred  to  as  the  saliency 


3.  REPRESENTATION  OF  VIDEO  USING  DYNAMIC 
TEXTURES 

The  discriminant  saliency  formulation  of  (1)  is  generic  and 
does  not  vary  with  the  type  of  stimulus  or  features  F  used. 

In  specific,  by  adopting  suitable  models  for  spatiotempo- 
ral  stimuli  (i.e.  video),  this  formulation  is  robust  enough  to 
compute  motion  saliency  in  highly  dynamic  scenes.  This  en¬ 
ables  the  design  of  powerful  target  detection  algorithms  by 
simple  reduction  of  target  detection  to  the  complement  of 
saliency  detection.  Under  this  formulation,  the  design  of  an 
algorithm  capable  of  handling  highly  dynamic  scenes  only  re¬ 
quires  the  use,  in  (3),  of  probability  models  PY\c(i)(y\c)  that 
can  capture  the  variability  associated  with  such  video  scenes. 
We  adopt  the  dynamic  texture  (DT)  model  of  (Doretto  et  al., 
2003),  due  to  its  ability  to  account  for  this  variability,  while 
jointly  modeling  the  spatial  and  temporal  characteristics  of 
the  visual  stimulus  in  an  elegant  unified  stochastic  framework. 

A  dynamic  texture  is  an  autoregressive  generative  model 
that  represents  the  appearance  of  the  stimulus  yt  e  Rm  (the 
two-dimensional  image  stimulus  is  first  converted  into  a  col¬ 
umn  vector  of  length  m),  observed  at  time  t ,  as  a  linear  func¬ 
tion  of  a  hidden  state  process  xt  e  M"  (n  «  m)  subject  to 
Gaussian  observation  noise.  The  state  and  appearance  pro¬ 
cesses  form  a  linear  dynamical  system  (LDS) 


x*  =  Ax,_i  +  v, 

y*  =  Cxt  +  wt 

where  A  6  Wnxn  is  the  state  transition  matrix,  C  e  Mmxn  the 
observation  matrix,  and  \t  ~iid  N  (0,  Q)  and  wt  ~M  N  (0,  R) 
are  Gaussian  state  and  observation  noise  processes,  respec¬ 
tively.  The  initial  state  is  assumed  to  be  distributed  as  x\  ~ 
N  (ji i ,  Si),  and  the  model  is  parameterized  by 

0  =  (A,  C,  Q,  R,/i1?  Si). 


(5) 


Principal  components 


R 


(10) 


Fig.  3.  Illustration  of  a  dynamic  texture  model.  The  first  three 
basis  images  are  shown  on  the  left,  and  the  corresponding 
state  space  variables  plotted  as  a  function  of  time.  At  each 
time  instant,  a  video  frame  is  represented  as  a  linear  combi¬ 
nation  of  the  basis  images,  with  weights  given  by  the  value  of 
the  corresponding  state  variable. 


The  hidden  state  space  sequence  xt  is  a  first  order  Markov 
chain  that  encodes  stimulus  dynamics,  while  yt  is  a  linear 
combination  of  prototypical  basis  functions  (the  columns  of 
C)  and  encodes  the  appearance  component  of  the  stimulus  at 
time  t.  Dynamic  texture  modeling  of  a  sequence  of  images  is 
illustrated  in  Figure  31 . 


3.1.  Learning  dynamic  texture  parameters 

Given  center  and  surround  regions,  DT  parameters  could  in 
principle  be  learned  by  maximum  likelihood  (using  expectation- 
maximization  (Shumway  &  Stoffer,  1982),  or  N4SID  (Over- 
schee  &  Moor,  1994)).  However,  due  to  the  high  dimension¬ 
ality  of  video  sequences,  these  solutions  are  too  complex  for 
motion  saliency.  A  suboptimal  alternative,  that  works  well  in 
practice(Doretto  et  al.,  2003),  is  to  learn  the  spatial  and  tem¬ 
poral  parameters  separately.  Given  N  sequences,  y^, . . . ,  y^, 
of  r  frames  each  (where  y^r  =  [y^  . . .  y^]),  sampled  from  a 
DT,  let  Yi:r  =  [y^, . . . ,  y^]  e  RmxA^T  matrix  com¬ 

posed  by  concatenating  all  sequences.  If  Yi:r  =  USVr  is  its 
singular  value  decomposition  (SVD),  the  DT  parameters  are 
estimated  as  follows, 


c 

=  U[1  :  n]  (first  n  columns  of  U) 

(6) 

X® 

xl:r 

ll 

o> 

(7) 

A 

=  X2:T(Xl:T.rf 

.  N  r-1 

1  y  y  v®(v®)r 

7V(r  _  1)  Zj  Zj  j  k  j  j 

(8) 

Q 

(9) 

lrThe  bottle  sequence  from  (Zhong  &  Sclaroff,  2003)  is  used  in  this  exam¬ 
ple. 


N  r—  1 


lU 


(11) 


where,  Xi:r  =  [x^, . . . ,  x^]  is  the  matrix  of  state  estimates, 
M' ^  the  pseudo-inverse  of  M,  v^}  =  x^  -  Ax^,  and  wrw  = 
y^  -  Cx^,  for  t  £  1 . . .  r.  Finally,  the  initial  state  parameters 
are  estimated  as, 

^  <i2> 

i=  1 

Si  =  ^  y]  X i'Vx I0)7  -  1  (13) 

i- 1 

Using  the  learned  model  parameters,  we  can  compute  prob¬ 
ability  distributions  over  the  DT.  The  states  of  a  DT  form  a 
Markov  process  with  Gaussian  conditional  probability  for  xt 
given  xt-\  (for  any  t).  So  for  Gaussian  initial  state  conditions, 
the  joint  distribution  of  the  state  sequence,  xi:r  =  [x^  . . .  x^]T 
is  also  Gaussian  (Chan  &  Vasconcelos,  2005) 


p(x1:r)  =  G(x1:t,/i,X)  (14) 


where 


Hr  |  and  the  covariance  is 
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(15) 


Similarly,  the  image  sequence  yi:r  is  distributed  as 


p(  yi:r)  =  G(yhT,y,0)  (16) 


where  y  -  Cp  and  =  CHCr  +  K,  and  C  and  K  are  block 
diagonal  matrices  formed  from  C  and  R  respectively: 
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At  any  location  l  of  the  video,  the  densities  of  (15)  can 
be  estimated  from  a  collection  of  spatio-temporal  patches  ex¬ 
tracted  from  the  center  and  surround  windows.  The  evaluation 
of  the  KL  divergence  between  DTs  is  needed  for  the  compu¬ 
tation  of  S  (/),with  (3). 

Let  poiyi.r)  and  p\(y\-T)  be  the  probabilities  of  a  sequence 
of  r  frames  under  two  DTs  parameterized  by  0o  and  @1,  cor¬ 
responding  to  the  surround  and  the  center  respectively.  For 


surround:  /,i,ic(/)(y(T)l° 


center:  /,nc(/>0'<T)lI) 


marginal:  PyW?)) 


Algorithm 

Saliency 


1  Target  Detection  via  Computation  of  Motion 


Input:  Given  video  A7  indexed  by  location  vector  l  e 
L  c  M3,  state-space  dimension  n ,  center  window  size  nc , 
patch  size  np,  temporal  window  r. 

for  /  e  L  do 

Identify  center  rW\  and  surround  TFj3. 


Fig.  4.  Illustration  of  the  center  and  surround  windows  used  to 
compute  the  saliency  of  location  /.  Conditional  distributions 
are  learned  from  the  center  and  surround  window,  while  the 
marginal  distribution  is  learned  from  the  total  window.  The 
saliency  measure  S  (/)  is  finally  computed  with  (3). 


Gaussian  po  and  pi,  the  KL  divergence  has  the  closed-form  (Cover 
&  Thomas,  1991): 


KL(/?0  \\pi ) 
1 
2 


(17) 


log  Hi +  tr  (®r‘®o)  +  ||ro  -  ri 


l®ol 


■  mr 


List  all  overlapping  patches  of  size  np  x  np  x  r  in 
and  TFj3 

From  the  patches  learn  dynamic  texture  parameters 
for  center  0i(/),  surround  0q(/)  and  the  total  0(/)  us¬ 
ing  (5)-(12). 

Compute  the  mutual  information,  S  (/),  between  class- 
conditional  and  total  densities  (3),  using  the  recursive 
implementation  of  (16). 

end  for 

Choose  threshold  value  T.  Find  regions  where  ltarget  - 
{l  e  L  :  S(l)  >  T). 

Output:  Target  locations  ltarget 


where  m  is  the  number  of  pixels  in  each  frame.  Direct  eval¬ 
uation  of  the  KL  is  computationally  intractable,  since  the  ex¬ 
pression  depends  on  d>0  and  Oi,  which  are  very  large  covari¬ 
ance  matrices.  An  efficient  recursive  procedure  is,  however, 
available  (Chan  &  Vasconcelos,  2004). 

4.  TARGET  DETECTION  ALGORITHM 


\ 

V 

In  an  unsupervised  task,  in  the  absence  of  any  information  re¬ 
garding  a  specific  previously  known  target,  motion  saliency 
provides  an  objective  way  of  defining  the  target  :  the  tar¬ 
get  consists  of  those  regions  of  the  video  with  high  motion 
saliency.  So,  moving  target  detection  is  performed  by  first 
computing  the  saliency  measure  S  (/)  at  each  location  /  of  the 
video. 

Center  and  surround  windows  are  centered  at  the  location, 
and  a  collection  of  spatio-temporal  patches  extracted  from 
each  window.  Prior  probabilities  for  both  classes  are  assumed 
to  be  equal  to  DT  parameters  are  then  learned  from  the 
center,  surround,  and  total  windows,  to  obtain  the  densities 
Pr|C(/)(y(T)|l),  Pr|c(/)(y(T)|0),  and  pY(y(r)),  respectively.  S  (/) 
is  finally  computed  with  (3),  a  the  recursive  implementation 
of  (16)  (Chan  &  Vasconcelos,  2004).  The  procedure  is  illus¬ 
trated  in  Figure  4.  Those  pixels  which  have  a  saliency  value 
above  a  predetermined  threshold  are  marked  as  belonging  to 
the  moving  target.  The  motion  saliency  based  target  detection 
algorithm  is  summarized  in  Algorithm  1 . 

5.  EXPERIMENTS  AND  RESULTS 

To  evaluate  target  detection  performance,  the  proposed  algo¬ 
rithm  was  tested  on  sequences  collected  from  the  web.  Frames 


Fig.  5.  Results  of  target  detection  on  a  skiing  clip  shot  with  a  moving 
camera,  with  heavy  snowfall  in  the  background:  (a)  original  (b)  detected 
target 

from  some  of  these  sequences  are  shown  in  panel  (a)  of  Fig¬ 
ures  5-7.  In  all  cases,  the  background  is  highly  dynamic. 
In  addition,  most  sequences  were  shot  with  significant  cam¬ 
era  motion.  Figure  5,  presents  frames  from  a  sequence  which 
depicts  a  person  skiing  in  heavy  snowfall.  A  pair  of  cyclists 
ride  through  a  grassy  plain  in  Figure  6,  while  an  aircraft  land¬ 
ing  is  tracked  using  a  moving  camera  in  Figure  7.  Due  to  the 
extreme  variability  in  background  these  clips  are  challenging 
for  conventional  foreground  detection  techniques. 

To  perform  target  detection,  the  sequences  were  converted 
to  grayscale,  and  saliency  maps  computed  at  sub-sampled  lo¬ 
cations  of  the  video,  using  a  grid  scaled  down  by  a  factor  of 
4  spatially  and  2  temporally.  At  each  grid  location,  the  center 
window  occupied  16  x  16  pixels  and  spanned  11  frames  -  5 
past  frames,  the  current  frame,  and  5  frames  in  the  future. 

Saliency  maps  obtained  using  the  proposed  algorithm  on 
the  test  clips  are  shown  in  panel  (b)  of  Figures  5,  6  and  7. 
Video  sequences  of  these  and  various  other  detection  exam¬ 
ples  are  available  from  www .  svcl .  ucsd .  edu/~proj  ects/ 
background_subtraction.  Even  though  the  background 
is  extremely  dynamic,  the  relevant  targets  are  detected  accu- 


6.  CONCLUSION 


Fig.  6.  Results  of  target  detection  on  clip  showing  a  pair  of  cyclists.  The 
camera  is  moving  to  track  the  cyclists,  causing  very  large  variability  in  the 
background:  (a)  original  (b)  detected  targets 


Fig.  7.  Results  of  target  detection  on  clip  showing  an  aircraft  landing. 
The  camera  is  moving  to  keep  the  aircraft  in  focus,  causing  variability  in 
the  background  which  consists  of  buildings,  cars  and  trees:  (a)  original  (b) 
detected  target 


rately,  in  all  three  cases. 

To  enable  a  quantitative  analysis,  all  sequences  were  man¬ 
ually  annotated  with  the  groundtruth  for  the  objects  of  in¬ 
terest.  The  saliency  maps  were  then  thresholded  at  a  large 
number  of  values,  and  using  the  groundtruth  information  false 
alarm  (a)  and  detection  rate  (J3 )  were  computed.  These  were 
used  to  generate  receiver  operating  characteristic  (ROC)  curves 
Using  the  ROC  curves,  the  equal  error  rate  (EER),  defined  as 
the  error  at  which  false  alarm  equals  miss  rate  (a  =  1 -J3),  was 
also  estimated.  The  EER  represents  a  quantitative  measure  of 
target  detection  performance  of  the  proposed  algorithm.  The 
low  EER  (average  of  around  4.7%  shows  that  the  proposed 
algorithm  identifies  the  target  reliably  with  low  false  positive 
rate.  Table  1  shows  the  EERs  for  the  three  clips  of  Figures  5 
-7. 


EER 

skiing 

3% 

cyclists 

8% 

landing 

3% 

Average 

4.7% 

Table  1.  Equal  Error  Rates  for  the  sequences  tested.  The  pro¬ 
posed  algorithm  has  very  low  EER  for  all  clips,  showing  that 
it  can  accurately  detect  the  target  with  very  low  false  postive 
rate. 


In  this  work,  we  have  proposed  an  algorithm  for  unsupervised 
moving  target  detection  based  on  center- surround  saliency. 
The  new  algorithm  is  inspired  by  biological  vision,  and  ex¬ 
tends  a  discriminant  formulation  of  center- surround  saliency 
previously  proposed  for  static  imagery.  By  using  dynamic 
texture  models  for  motion,  we  derive  an  information  theo¬ 
retic  measure  of  motion  saliency.  The  discriminant  center- 
surround  framework,  in  combination  with  the  modeling  power 
of  dynamic  textures  leads  to  a  robust  and  versatile  algorithm 
that  can  be  applied  to  scenes  with  highly  dynamic  backgrounds, 
even  when  the  camera  is  moving.  The  algorithm  combines 
spatial  and  temporal  components  of  saliency  in  a  principled 
manner.  Being  completely  unsupervised  it  does  not  require 
any  training  and  can  thus  be  automatically  deployed  to  new 
scenes,  with  no  need  for  manual  supervision  or  parameter  tun¬ 
ing.  As  the  algorithm  can  work  even  for  moving  cameras,  it 
can  also  be  incorporated  into  hand-held  or  vehicle  mounted 
sensing  devices. 

Potential  applications  for  the  army  include  automated  surveil¬ 
lance  with  alerts  for  specific  events,  detection  of  events  in 
archived  video,  crowd  monitoring,  detection  of  breaches  of 
borders  and  other  secure  areas,  path  planning  for  autonomous 
vehicles  and  automated  target  tracking. 
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